Reliable and Locality Driven Scheduling in Hadoop

نویسندگان

  • Tran Anh Phuong
  • Manuel Antunes Veiga
  • Eduardo Teixeira Rodrigues
  • David Manuel Martins de Matos
چکیده

The increasing use of computing resources in our daily lives leads to data being generated at an unprecedent rate. The computing industry is being repeatedly questioned for its ability to accommodate the unpredictable growth rate of data, and its ability to process them. This has encouraged the development of cluster based data-intensive applications. Hadoop is a popular open source framework known for its massive cluster based data processing power. Hadoop is widely used in the computer industry because of its scalability, reliability, ease of use, and low cost of implementation. Cloud computing in the recent years has gained increasingly popularity by its cost-efficient and flexible way to leverage the power of commodity hardware. Hadoop based services on the Cloud have also emerged as one of the prominent choices for smaller businesses. However, evidence in the literature shows that faults on the Cloud do occur and normally result with performance problems. Hadoop hides the complexity of discovery and handling failures from the schedulers, but the expenses of failure recovery rest entirely on users, regardless of root causes. We systematically assess these expenses through a set of experiments, and argue that more effort to reduce this cost to users is desirable. We also analyze the drawback of current Hadoop’s mechanism in prioritizing failed tasks. By trying to launch failed tasks as soon as possible regardless of locality, it significantly increases the execution time of jobs with failed tasks, due to two reasons: 1) available slots might not be free up as quickly as expected and 2) the slots might belong to machines with no data on it, introducing extra cost for data transferring through network, which is normally the most scarce resource in nowadays’ data centers. This thesis then introduces a new algorithmic approach called the waste-free preemption. The waste-free preemption saves Hadoop scheduler from choosing solely between kill, which instantly releases the slots but is wasteful, and wait, which does not waste any previous effort but suffers from the two above mentioned reasons. With this new option, a preemptive version of Hadoop default scheduler (FIFO) is implemented. The evaluation demonstrates the effectiveness of the new feature by comparing its performance with the traditional Hadoop mechanism.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Hadoop Scheduling Base On Data Locality

In hadoop, the job scheduling is an independent module, users can design their own job scheduler based on their actual application requirements, thereby meet their specific business needs. Currently, hadoop has three schedulers: FIFO, computing capacity scheduling and fair scheduling policy, all of them are take task allocation strategy that considerate data locality simply. They neither suppor...

متن کامل

Maximizing Data Locality in Hadoop Clusters via Controlled Reduce Task Scheduling

The overall goal of this project is to gain a hands-on experience with working on a large open-ended research-oriented project using the Hadoop framework. Hadoop is an open source implementation of MapReduce and Google File System, and is currently enjoying wide popularity. Students will modify the task scheduler of Hadoop, conduct several experimental studies, and analyze performance and netwo...

متن کامل

Shareability and Locality Aware Scheduling Algorithm in Hadoop for Mobile Cloud Computing

Using different scheduling algorithms can affect the performance of mobile cloud computing using Hadoop MapReduce framework. In Hadoop MapReduce framework, the default scheduling algorithm is First-In-First-Out (FIFO). However, the FIFO scheduler simply schedules task according to its arrival time and does not consider any other factors that may have great impact on system performance. As a res...

متن کامل

Locality Aware Fair Scheduling for Hammr

Hammr is a distributed execution engine for data parallel applications modeled after Dryad. In this report, we present a locality aware fair scheduler for Hammr. We have developed functionality to support hierarchical scheduling, preemption and weighed users and a minimum flow based algorithm to maximize task preference. For evaluation, we’ve run Hammr on Hadoop Distributed File System on Amazo...

متن کامل

An adaptive scheduling algorithm for dynamic heterogeneous Hadoop systems

The MapReduce and Hadoop frameworks were designed to support efficient large scale computations. There has been growing interest in employing Hadoop clusters for various diverse applications. A large number of (heterogeneous) clients, using the same Hadoop cluster, can result in tensions between the various performance metrics by which such systems are measured. On the one hand, from the servic...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2014